Overview

Subquestions:

1. What is the most frequent word used in the Populor Songs?

4. Is there a pattern for those successful songs?

5. Is there a relationship between the mood in songs with the Twitter Users’ attitudes?

Data Collection and Data Cleaning

Music Data

This project collected the information for music and Twitter posts from different APIs.

The billboard.py, a Python API for accessing music charts from Billboard.com, is used to collect the tiltles and artists’ names. With the music information, the PyLyrics, a python module to get lyrics of songs from lyrics.wikia.com, helped to find those lyrics.

The biggest problem for the data cleaning in this part is those special signs in the titles or the singer lists:

Graph 1(Tableau)

Besides, there are also some languages other than English in the dataset:

Graph 2(matplotlib), Source Code
Graph 3(matplotlib), Source Code

We Can see that there are very few songs using languages other than English. The songs with different languages are deleted so that the sentimental analysis wil be more accurate.

The comparision between data before and after data cleaning

Graph 4(Tableau)

Twitter Data

Twitter data cleaning

When cleaning Twitter data, we first check whether same Twitter post link appear more than once in our dataset, if it does we remove all those duplicate records. Then we removed those non-alphabetic symbol and also hyper-link from text data. And the plot below provide statistics on those records we cleaned.

Graph 6(matplotlib), Source Code

Twitter sentiment analysis

To generate sentiment score for each Tweets, we build our sentiment analysis tool using Word2Vec and Deep Learning tools. We first obtain our training dataset which has 1.4 million of Twitter data with labelled sentiment score from Sentiment140. Then we tokenize both training data and our own Twitter data using NLTK tokenizer. Then vectorize each token using gensim Word2Vec model with vector size 512 and window size 10. Then construct our 7 layer Convolutional Neural Network model and trained it using our training data.

With 10 fold cross validation, we gained 75% accuracy and we apply our model to predict sentiment score on collected Twitter data. Which yield frequency distribution as below:

Graph 7(matplotlib), Source Code

1. What are the most frequent words used in Populor Songs?

Let’s start to find our answer by showing some EDA plot. In the first sub-question, our team want to simply study what is the word used most frequently in the lyrics. We first use the word cloud plot for all the songs in Billboard Top 100 to study it.

Graph 8(matplotlib, Word Cloud), Source Code

Talyor Swift Example

## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
## Loading required package: NLP
## 
## Attaching package: 'NLP'
## The following object is masked from 'package:ggplot2':
## 
##     annotate
## Loading required package: RColorBrewer
## 
## Attaching package: 'tidyr'
## The following object is masked from 'package:magrittr':
## 
##     extract
## 
## Attaching package: 'igraph'
## The following object is masked from 'package:tidyr':
## 
##     crossing
## The following objects are masked from 'package:dplyr':
## 
##     as_data_frame, groups, union
## The following object is masked from 'package:plotly':
## 
##     groups
## The following objects are masked from 'package:stats':
## 
##     decompose, spectrum
## The following object is masked from 'package:base':
## 
##     union
## ========================================
## circlize version 0.4.4
## CRAN page: https://cran.r-project.org/package=circlize
## Github page: https://github.com/jokergoo/circlize
## Documentation: http://jokergoo.github.io/circlize_book/book/
## 
## If you use it in published research, please cite:
## Gu, Z. circlize implements and enhances circular visualization 
##   in R. Bioinformatics 2014.
## ========================================
## 
## Attaching package: 'circlize'
## The following object is masked from 'package:igraph':
## 
##     degree
## 
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
## 
##     smiths
## # A tibble: 10 x 2
##    track_title                  length
##    <fct>                         <int>
##  1 Sad Beautiful Tragic            183
##  2 A Perfectly Good Heart          224
##  3 The Outside                     227
##  4 State of Grace                  231
##  5 A Place In This World           232
##  6 Breathe (Ft.聽Colbie聽Caillat)    234
##  7 Cold as You                     242
##  8 Tied Together With A Smile      245
##  9 Invisible                       248
## 10 Come Back... Be Here            267

Graph 18(ggplot), Source Code

The average word count for the tracks stands close to 375, and chart shows that maximum number of songs fall in between 345 to 400 words. The density plot shows that the distribution is close to a normal distribution.

Graph 19(ggplot), Source Code

The Album-wise word count shows that there tend to be more words in the lastest album.

Graph 20(ggplot), Source Code

This plot shows the similiar result. This means, to be a super star, your need to convey more information in your songs, with which you can influence your audience.

Graph 21(ggplot), Source Code

Basically, the most frequent mood in the songs is positve. And we can see that Talyor Swift have expressed all kinds emotions in her songs. Joy, anticipation and trust emerge as the top 3.

Graph 22(ggplot), Source Code

Now that we have figured out the overall sentiment scores, we should find out the top words that contribute to various emotions and positive/negative sentiment.The visualization given shows that while the word bad is predominant in emotions such as anger, disgust, sadness and fear, Surprise and trust are driven by the word good.

Graph 23(ggplot), Source Code

We can see that joy has maximum share for the years 2010 and 2014. Overall, surprise, disgust and anger are the emotions with least score; however, in comparison to other years 2017 has maximum contribution for disgust. Coming to anticipation, 2010 and 2012 have higher contribution in comparison to other years.

5. Is there a relationship between the mood in songs with the Twitter Users’ attitudes?

Graph 24(Seaborn), Source Code

The graph above shows the number of favorites and retweets to the songs in the years. As we can directly, there is a huge difference before year 2017 and after year 2017. After year 2017, the number of favorties and retweets increased incredibly. That tells either the songs on the board have more influence after year 2017 or people started to be carzy of using tweets.

Graph 26(Seaborn), Source Code

The graph above shows the distribution of songs’ popularity in each year. As we can see, 2017 has highest mean, which means most songs created in 2017 are popular. However, for 2008, although the high point is really high in this year, but the mean is low. which means it has good song and also bad songs in this year.

Graph 27(Seaborn), Source Code

The graph above shows the popularity of the songs in each month in these years. From the graph, we also expect to see some trend through the changing of the color in the graph. And as the result we can tell from the graph, it did has slight pattern of the increasing popularity from 2008 to around 2013. However, after that, popularity of the songs in each month became unpredictable.

5.The fifth sub-question is whether there is relationship between the sentiment analysis resutls from tweets and lyrics. In other words, the question becomes “Do the lyrics’ attitudes influence people’s comments about them on tweets?” and “whether the result of two attitudes’ connection influence the popularity of the songs?”

In order to answer the questions above, we have to take a look of the results of sentiment analysis from tweets and lyrics.

Graph 28(Seaborn), Source Code

The graph above showed the two-dimensional plot of sentiment analysis attitude scores. As we can see above, it seems like there is no linear relationship between the tweets’ attitude and lyrics’ attitude. However, we can not be sure yet untill we run some tests. However, the graph gives us one interesting information. Most points are located on the top and bottom of the graph, which means the songs that are too negative or too positive have more attetion on tweets than the normal songs.

Next, we take a look into each indiviual analysis results.

Graph 29(Seaborn), Source Code

As the boxplots show above, we can see that lyrics has much higher attitude points than tweets. Most of tweets are below normal, which means they are more negative while most lyrics of the songs are more positive. However, we still can be 100% sure they are significant difference, so we will run one-way ANOVA test to test our thought.

Hypothesis test of One-Way ANOVA test

Hypothesis1, Source Code

As the p-value showed above, since it is below 0.05, which rejects the null hypothesis, which means there is a significant difference between lyrics’ attitude points and tweets’ attitude points. Since we know they are different, we are moving to next level, to test whether there is a relationship.

Hypothesis test of Linear regression test.

Hypothesis2, Source Code